Web Harvesting
نویسنده
چکیده
DEFINITION Web harvesting describes the process of gathering and integrating data from various heterogeneous web sources. Necessary input is an appropriate knowledge representation of the domain of interest (e.g. an ontology), together with example instances of concepts or relationships (seed knowledge). Output is structured data (e.g. in the form of a relational database) that is gathered from the Web. The term harvesting implies that, while passing over a large body of available information, the process gathers only such information that lies in the domain of interest and is, as such, relevant.
منابع مشابه
FOCIH: Form-Based Ontology Creation and Information Harvesting
Creating an ontology and populating it with data are both labor-intensive tasks requiring a high degree of expertise. Thus, scaling ontology creation and population to the size of the web in an effort to create a web of data—which some see as Web 3.0—is prohibitive. Can we find ways to streamline these tasks and lower the barrier enough to enable Web 3.0? Toward this end we offer a form-based a...
متن کاملA Simple Mechanism for Focused Web-harvesting
The focused web-harvesting is deployed to realize an automated and comprehensive index databases as an alternative way for virtual topical data integration. The web-harvesting has been implemented and extended by not only specifying the targeted URLs, but also predefining human-edited harvesting parameters to improve the speed and accuracy. The harvesting parameter set comprises three main comp...
متن کاملAn Architecture for Selective Web Harvesting: The Use Case of Heritrix
In this paper we provide a brief overview of the crawling architecture of ARCOMEM and how it addresses the challenges arising in the context of selective web harvesting. We describe some of the main technologies developed to perform selective harvesting and we focus on a modified version of the open source crawler Heritrix, which we have adapted to fit in ACROMEM’s crawling architecture. The si...
متن کاملHarvesting the Bitexts of the Laws of Hong Kong From the Web
In this paper we present our recent work on harvesting English-Chinese bitexts of the laws of Hong Kong from the Web and aligning them to the subparagraph level via utilizing the numbering system in the legal text hierarchy. Basic methodology and practical techniques are reported in detail. The resultant bilingual corpus, 10.4M English words and 18.3M Chinese characters, is an authoritative and...
متن کاملLanguage ID in the Context of Harvesting Language Data off the Web
As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as “solved” is language identification (language ID) of written text. However, we argue that languag...
متن کامل